Chapter 2 Bias
2.1 Example: Different level of analysis
The maps demonstrate the geographic gradients of obesity and diabetes rates in the state of Utah and New York. The left panel is estimates at the census tract level, while the right panel is estimates at the county level.
## <br><br>
tmap_arrange(UT_Diabetes_tract, UT_Diabetes_county, ncol=2, outer.margins = 0.02)cat("<br><br>")## <br><br>
tmap_arrange(NY_Obesity_tract, NY_Obesity_county, ncol=2, outer.margins = 0.02)cat("<br><br>")## <br><br>
tmap_arrange(NY_Diabetes_tract, NY_Diabetes_county, ncol=2, outer.margins = 0.02)cat("<br><br>")## <br><br>
Here is a summary table.
2.2 Bias due to omitted confounders
\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_2 + \dots + \epsilon_i; \;\; for \;\; i=1, \dots, n\]
where the errors \(\epsilon_i \sim N(0, \sigma^2)\) with independent and identically distributed (i.i.d.)
Let’s assume the following association is true (i.e., gold standard) without any selection bias, measurement bias, and other unmeasured confoundings.
N <- 100000
C <- rnorm(N)
X <- .5 * C + rnorm(N)
Y <- .3 * C + .4 * X + rnorm(N)2.2.1 Gold standard
With the correct model specification (i.e., \(C\) as a confounder), we get an unbiased estimate of \(X\) on \(Y\).
# Gold standard
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001275 0.003167 0.403 0.687
## X 0.401728 0.003155 127.319 <2e-16 ***
## C 0.305416 0.003545 86.161 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.002752)
##
## Null deviance: 142238 on 99999 degrees of freedom
## Residual deviance: 100272 on 99997 degrees of freedom
## AIC: 284068
##
## Number of Fisher Scoring iterations: 2
2.2.2 Misspecified model: a confounder, \(C\), was omitted from the model
By omitting \(C\), the estimate of \(X\) was biased either “away from” or “towards to” the null
# C was omitted
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0006198 0.0032821 0.189 0.85
## X 0.5234680 0.0029241 179.019 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.077185)
##
## Null deviance: 142238 on 99999 degrees of freedom
## Residual deviance: 107716 on 99998 degrees of freedom
## AIC: 291227
##
## Number of Fisher Scoring iterations: 2
2.2.3 Bias “away from” or “towards” the null?
N <- 100000
C <- rnorm(N)
X <- -.5 * C + rnorm(N)
Y <- -.3 * C + .4 * X + rnorm(N)# C was omitted
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001155 0.003162 0.365 0.715
## X 0.405403 0.003144 128.939 <2e-16 ***
## C -0.297054 0.003532 -84.099 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9997488)
##
## Null deviance: 141635 on 99999 degrees of freedom
## Residual deviance: 99972 on 99997 degrees of freedom
## AIC: 283768
##
## Number of Fisher Scoring iterations: 2
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001605 0.003272 0.491 0.624
## X 0.523397 0.002912 179.766 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.070449)
##
## Null deviance: 141635 on 99999 degrees of freedom
## Residual deviance: 107043 on 99998 degrees of freedom
## AIC: 290600
##
## Number of Fisher Scoring iterations: 2
2.2.4 A \(C\) is not a confounder on \(X\) and \(Y\)
N <- 100000
C <- rnorm(N)
X <- rnorm(N)
Y <- .4 * X + rnorm(N)2.2.5 Correct model specification: Without \(C\)
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007034 0.003166 2.222 0.0263 *
## X 0.404101 0.003176 127.239 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.002215)
##
## Null deviance: 116445 on 99999 degrees of freedom
## Residual deviance: 100220 on 99998 degrees of freedom
## AIC: 284013
##
## Number of Fisher Scoring iterations: 2
2.2.6 Misspecified model with \(C\)
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007039 0.003166 2.223 0.0262 *
## X 0.404118 0.003176 127.243 <2e-16 ***
## C -0.003170 0.003173 -0.999 0.3178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.002215)
##
## Null deviance: 116445 on 99999 degrees of freedom
## Residual deviance: 100219 on 99997 degrees of freedom
## AIC: 284014
##
## Number of Fisher Scoring iterations: 2
2.2.7 A \(C\) is a colloder on \(X\) and \(Y\)
N <- 100000
X <- rnorm(N)
Y <- .7 * X + rnorm(N)
C <- 1.2 * X + .6 * Y + rnorm(N)2.2.8 Correct model specification: Without \(C\)
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0003944 0.0031629 -0.125 0.901
## X 0.6966664 0.0031746 219.447 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.000377)
##
## Null deviance: 148211 on 99999 degrees of freedom
## Residual deviance: 100036 on 99998 degrees of freedom
## AIC: 283829
##
## Number of Fisher Scoring iterations: 2
2.2.9 Misspecified model with \(C\)
This is one of examples of selection bias. For example, let’s say, \(X\) is Education, \(Y\) is income, and \(C\) is social welfare program. People at lower education (i.e., high risk group in terms of exposure) and lower income (i.e., higher risk group in terms of outcome) are more likely to register social welfare program. If survey was conducted based on the registered social welfare program, the “estimated” association from this “disproportionally selected” respondents are likely biased.
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0007675 0.0027135 -0.283 0.777298
## X -0.0164534 0.0046471 -3.541 0.000399 ***
## C 0.4414927 0.0023311 189.391 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7362814)
##
## Null deviance: 148211 on 99999 degrees of freedom
## Residual deviance: 73626 on 99997 degrees of freedom
## AIC: 253178
##
## Number of Fisher Scoring iterations: 2
2.3 Overadjustment bias
Please note that this is not a comprehensive example; only reflect one aspect of potential overadjustement bias.
Let’s assume a model with \(M\) as a mediator.
N <- 100000
X <- rnorm(N)
M <- .5 * X + rnorm(N)
Y <- .3 * X + .4 * M + rnorm(N)2.4 Total effect
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0006419 0.0034065 -0.188 0.851
## X 0.5058361 0.0034073 148.456 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.160389)
##
## Null deviance: 141611 on 99999 degrees of freedom
## Residual deviance: 116037 on 99998 degrees of freedom
## AIC: 298667
##
## Number of Fisher Scoring iterations: 2
2.5 Overadjustment
glm.unbiased <- glm(Y~X + M, family="gaussian")
summary(glm.unbiased)##
## Call:
## glm(formula = Y ~ X + M, family = "gaussian")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002150 0.003164 -0.679 0.497
## X 0.306941 0.003536 86.816 <2e-16 ***
## M 0.398001 0.003154 126.179 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.001023)
##
## Null deviance: 141611 on 99999 degrees of freedom
## Residual deviance: 100099 on 99997 degrees of freedom
## AIC: 283895
##
## Number of Fisher Scoring iterations: 2
2.6 Logistic models
2.6.1 Sex as a Confounder, \(C\)
MYY <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 1,
freq = 5
)
MYN <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 0,
freq = 8
)
MNY <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 1,
freq = 45
)
MNN <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 0,
freq = 72
)
FYY <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 1,
freq = 25
)
FYN <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 0,
freq = 10
)
FNY <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 1,
freq = 25
)
FNN <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 0,
freq = 10
)
Ex_confounder <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)Convert Freq table to raw data
library(tidyr)
raw_confounder <- Ex_confounder %>%
uncount(freq)glm.unbiased <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder)
summary(glm.unbiased)##
## Call:
## glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_confounder)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1582 0.1627 -0.972 0.3309
## SmokingYes 0.6690 0.3397 1.970 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 277.26 on 199 degrees of freedom
## Residual deviance: 273.28 on 198 degrees of freedom
## AIC: 277.28
##
## Number of Fisher Scoring iterations: 4
- Full model:
glm_logit <- glm(Cancer ~ Smoking + Sex , family=binomial(link = "logit"), data=raw_confounder)
glm_logit##
## Call: glm(formula = Cancer ~ Smoking + Sex, family = binomial(link = "logit"),
## data = raw_confounder)
##
## Coefficients:
## (Intercept) SmokingYes SexMale
## 9.163e-01 4.266e-15 -1.386e+00
##
## Degrees of Freedom: 199 Total (i.e. Null); 197 Residual
## Null Deviance: 277.3
## Residual Deviance: 257 AIC: 263
- Stratified models
## For males
raw_confounder_M <- raw_confounder[ which(raw_confounder$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_M)
glm_logit_m##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_confounder_M)
##
## Coefficients:
## (Intercept) SmokingYes
## -4.700e-01 6.672e-16
##
## Degrees of Freedom: 129 Total (i.e. Null); 128 Residual
## Null Deviance: 173.2
## Residual Deviance: 173.2 AIC: 177.2
# For females
raw_confounder_F <- raw_confounder[ which(raw_confounder$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_F)
glm_logit_f##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_confounder_F)
##
## Coefficients:
## (Intercept) SmokingYes
## 9.163e-01 9.400e-16
##
## Degrees of Freedom: 69 Total (i.e. Null); 68 Residual
## Null Deviance: 83.76
## Residual Deviance: 83.76 AIC: 87.76
2.6.2 Sex as a Moderator, \(M\)
MYY <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 1,
freq = 5
)
MYN <- data.frame(Sex = "Male",
Smoking = "Yes",
Cancer = 0,
freq = 4
)
MNY <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 1,
freq = 45
)
MNN <- data.frame(Sex = "Male",
Smoking = "No",
Cancer = 0,
freq = 68
)
FYY <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 1,
freq = 25
)
FYN <- data.frame(Sex = "Female",
Smoking = "Yes",
Cancer = 0,
freq = 14
)
FNY <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 1,
freq = 25
)
FNN <- data.frame(Sex = "Female",
Smoking = "No",
Cancer = 0,
freq = 14
)
Ex_moderator <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)Convert Freq table to raw data
library(tidyr)
raw_moderator <- Ex_moderator %>%
uncount(freq)- Full model:
glm_logit <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator)
glm_logit##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_moderator)
##
## Coefficients:
## (Intercept) SmokingYes
## -0.1582 0.6690
##
## Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
## Null Deviance: 277.3
## Residual Deviance: 273.3 AIC: 277.3
- Stratified models
## For males
raw_moderator_M <- raw_moderator[ which(raw_moderator$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_M)
glm_logit_m##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_moderator_M)
##
## Coefficients:
## (Intercept) SmokingYes
## -0.4128 0.6360
##
## Degrees of Freedom: 121 Total (i.e. Null); 120 Residual
## Null Deviance: 165.1
## Residual Deviance: 164.3 AIC: 168.3
# For females
raw_moderator_F <- raw_moderator[ which(raw_moderator$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_F)
glm_logit_f##
## Call: glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"),
## data = raw_moderator_F)
##
## Coefficients:
## (Intercept) SmokingYes
## 5.798e-01 -2.621e-16
##
## Degrees of Freedom: 77 Total (i.e. Null); 76 Residual
## Null Deviance: 101.8
## Residual Deviance: 101.8 AIC: 105.8